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Abstract 

Background: First-generation molecular profiles for human breast cancers have enabled the identification of 
features that can predict therapeutic response; however, little is known about how the various data types can best 
be combined to yield optimal predictors. Collections of breast cancer cell lines mirror many aspects of breast 
cancer molecular pathobiology, and measurements of their omic and biological therapeutic responses are 
well-suited for development of strategies to identify the most predictive molecular feature sets. 

Results: We used least squares-support vector machines and random forest algorithms to identify molecular 
features associated with responses of a collection of 70 breast cancer cell lines to 90 experimental or approved 
therapeutic agents. The datasets analyzed included measurements of copy number aberrations, mutations, gene 
and isoform expression, promoter methylation and protein expression. Transcriptional subtype contributed strongly 
to response predictors for 25% of compounds, and adding other molecular data types improved prediction for 
65%. No single molecular dataset consistently out-performed the others, suggesting that therapeutic response is 
mediated at multiple levels in the genome. Response predictors were developed and applied to TCGA data, and 
were found to be present in subsets of those patient samples. 

Conclusions: These results suggest that matching patients to treatments based on transcriptional subtype will 
improve response rates, and inclusion of additional features from other profiling data types may provide additional 
benefit. Further, we suggest a systems biology strategy for guiding clinical trials so that patient cohorts most likely 
to respond to new therapies may be more efficiently identified. 



Background 

Breast cancer is a clinically and genomically heteroge- 
neous disease. Six subtypes were defined approximately 
a decade ago based on transcriptional characteristics and 
were designated luminal A, luminal B, ERBB2-enriched, 
basal-like, claudin-low and normal-like [1,2]. New cancers 
can be assigned to these subtypes using a 50-gene tran- 
scriptional signature designated the PAM50 [1]. However, 
the number of distinct subtypes is increasing steadily as 
multiple data types are integrated. Integration of genome 
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copy number and transcriptional profiles defines 10 
subtypes [3], and adding mutation status [4], methylation 
pattern [5], pattern of splice variants [6], protein and 
phosphoprotein expression [7] and microRNA expression 
and pathway activity [8] may define still more subtypes. 
The Cancer Genome Atlas (TCGA) project and other 
international genomics efforts were founded to improve 
our understanding of the molecular landscapes of most 
major tumor types with the ultimate goal of increasing 
the precision with which individual cancers are man- 
aged. One application of these data is to identify mo- 
lecular signatures that can be used to assign specific 
treatment to individual patients. However, strategies to 
develop optimal predictive marker sets are still being 
explored. Indeed, it is not yet clear which molecular 
data types (genome, transcriptome, proteome, and so on) 
will be most useful as response predictors. 
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In breast cancer, cell lines mirror many of the molecular 
characteristics of the tumors from which they were derived, 
and are therefore a useful preclinical model in which to ex- 
plore strategies for predictive marker development [8,9]. To 
this end, we have analyzed the responses of 70 well charac- 
terized breast cancer cell lines to 90 compounds and used 
two independent machine learning approaches to identify 
pretreatment molecular features that are strongly associated 
with responses within the cell line panel. For most com- 
pounds tested, in vitro cell line systems provide the only 
experimental data that can be used to identify predictive 
response signatures, as most of the compounds have not 
been tested in clinical trials. Our study focuses on breast 
cancer [10,11] and extends earlier efforts [12-14], by includ- 
ing more cell lines, by evaluating a larger number of com- 
pounds relevant to breast cancer, and by increasing the 
molecular data types used for predictor development. Data 
types used for correlative analysis include pretreatment 
measurements of mRNA expression, genome copy number, 
protein expression, promoter methylation, gene mutation, 
and transcriptome sequence (RNAseq). This compendium 
of data is now available to the community as a resource for 
further studies of breast cancer and the inter-relationships 
between data types. We report here on initial machine 
learning-based methods to identify correlations between 
these molecular features and drug response. In the process, 
we assessed the utility of individual data sets and the inte- 
grated data set for response predictor development. We 
also describe a publicly available software package that we 
developed to predict compound efficacy in individual tu- 
mors based on their omic features. This tool could be used 
to assign an experimental compound to individual patients 
in marker-guided trials, and serves as a model for how to 
assign approved drugs to individual patients in the clinical 
setting. We explored the performance of the predictors by 
using it to assign compounds to 306 TCGA samples based 
on their molecular profiles. 

Results and discussion 

Breast cancer cell line panel 

We assembled a collection of 84 breast cancer cell lines 
composed of 35 luminal, 27 basal, 10 claudin-low, 7 
normal-like, 2 matched normal cell lines, and 3 of unknown 
subtype (Additional file 1) [8]. Fourteen luminal and 7 basal 
cell lines were also ERBB2-amplified. Seventy cell lines were 
tested for response to 138 compounds by growth inhibition 
assays. The cells were treated in triplicate with nine dif- 
ferent concentrations of each compound as previously 
described [8]. The concentration required to inhibit growth 
by 50% (GI 50 ) was used as the response measure for each 
compound. Compounds with low variation in response in 
the cell line panel were eliminated, leaving a response data 
set of 90 compounds. An overview of the 70 cell lines with 
subtype information and 90 therapeutic compounds with 



GI 50 values is provided in Additional file 1. All 70 lines were 
used in development of at least some predictors depending 
on data type availability. The therapeutic compounds 
include conventional cytotoxic agents such as taxanes, 
platinols and anthracyclines, as well as targeted agents such 
as hormone and kinase inhibitors. Some of the agents 
target the same protein or share common molecular 
mechanisms of action. Responses to compounds with 
common mechanisms of action were highly correlated, 
as has been described previously [8]. 

A rich and multi-omic molecular profiling dataset 

Seven pretreatment molecular profiling data sets were 
analyzed to identify molecular features associated with 
response. These included profiles for DNA copy number 
(Affymetrix SNP6 - EGA accessions EGAS00000000059 
and EGAS00001000585), mRNA expression (Affymetrix 
U133A and Exon 1.0 ST array - ArrayExpress accessions 
E-TABM-157 and E-MTAB-181), transcriptome sequence 
(RNAseq - Gene Expression Omnibus (GEO) accession 
GSE48216), promoter methylation (Illumina Methylation27 
BeadChip - GEO accession GSE42944), protein abundance 
(Reverse Protein Lysate Array - Additional file 2), and mu- 
tation status (Exome-Seq - GEO accession GSE48216). The 
data were preprocessed as described in Supplementary 
Methods of Additional file 3. Figure SI in Additional file 3 
gives an overview of the number of features per data set 
before and after filtering based on variance and signal 
detection above background where applicable. Exome-seq 
data were available for 75 cell lines, followed by SNP6 data 
for 74 cell lines, therapeutic response data for 70, RNAseq 
for 56, exon array for 56, Reverse Phase Protein Array 
(RPPA) for 49, methylation for 47, and U133A expression 
array data for 46 cell lines. Information on the overlap in 
cell lines with both response data and molecular data is 
provided in Additional file 3. The set of 48 core cell lines 
was defined as those with response data and at least 4 mo- 
lecular data sets. 

Inter-data relationships 

We investigated the association between expression, copy 
number and methylation data. We distinguished correlation 
at the cell line level and gene level. At the cell line level, we 
report average correlation between datasets for each cell 
line across all genes, while correlation at the gene level rep- 
resents the average correlation between datasets for each 
gene across all cell lines. Correlation among the three ex- 
pression datasets (U133A, exon array, and RNAseq) ranged 
from 0.6 to 0.77 at the cell line level, and from 0.58 to 0.71 
at the gene level. Promoter methylation and gene expres- 
sion were, on average, negatively correlated as expected, 
with correlation ranging from -0.16 to -0.25 at the cell line 
level and -0.10 to -0.15 at the gene level. Across the gen- 
ome, copy number and gene expression were positively 
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correlated (0.18 to 0.22 at the cell line level; 0.35 to 0.44 at 
the gene level). When restricted to copy number aberra- 
tions, 22 to 39% of genes in the aberrant regions showed a 
significant concordance between their genomic and tran- 
scriptomic profiles from U133A, exon array and RNAseq 
after multiple testing correction (see the Intra-data rela- 
tionships' section in Supplementary Results in Additional 
file 3 and Table S4a-c in Additional file 3). 

Machine learning approaches identify accurate cell 
line-derived response signatures 

We developed candidate response signatures by analyzing 
associations between biological responses to therapy 
and pretreatment omic signatures. We used the inte- 
grative approach displayed in Figure 1 for the con- 
struction of compound sensitivity signatures. Standard 
data pre-processing methods were applied to each dataset. 
Classification signatures for response were developed 
using the weighted least squares support vector ma- 
chine (LS-SVM) [15] in combination with a grid search 
approach for feature optimization, as well as random for- 
ests (RF) [16], both described in detail in the Supplemen- 
tary Methods in Additional file 3. For this, the cell lines 
were divided into a sensitive and resistant group for each 
compound using the mean GI 50 value for that compound 
(Additional file 4). This seemed most reasonable after man- 
ual inspection, with concordant results obtained using TGI 
(concentration required to achieve total growth inhibition) 
as response measure. Multiple random divisions of the 
cell lines into two-thirds training and one-third test sets 
were performed for both methods, and area under a re- 
ceiver operating characteristic curve (AUC) was calcu- 
lated as an estimate of accuracy (Additional file 3). 

The candidate signatures incorporated copy number, 
methylation, transcription and/or proteomic features. We 
also included the mutation status of TP53, PIK3CA, MLL3, 
CDH1, MAP2K4, PTEN and NCOR1, chosen based on re- 
ported frequencies from TCGA breast project. That 
project sequenced the exomes of 507 breast invasive 
carcinomas and identified approximately 30,000 som- 
atic mutations [4]. Each of the 7 genes was mutated in 
at least 3% of samples with a false discovery rate (FDR) 
P-value <0.05. Our whole exome sequencing showed that 
these genes were also mutated in at least 3% of the 
breast cancer cell lines. Their mutation rate in TCGA 
and the cell line panel showed a similar distribution 
across the subtypes (Figure S2 in Additional file 3). We 
excluded lower prevalence mutations because their 
low frequency limits the possibility of significant 
associations. 

These signatures incorporating any of the molecular fea- 
tures are shown in Additional file 5. They predicted com- 
pound response within the cell lines with high estimated 
accuracy (AUC >0.70) regardless of classification method 



for 51 (57%) of the compounds tested. Concordance be- 
tween GI 50 and TGI exceeded 80% for 67% (34/51) of these 
compounds. A comparison across all 90 compounds of the 
LS-SVM and RF models with highest AUC based on copy 
number, methylation, transcription and/or proteomic fea- 
tures revealed a high correlation between both classification 
methods (Spearman correlation coefficient = 0.85, P-value 
<0.001), with the LS-SVM more predictive for 35 com- 
pounds and RF for 55 compounds (Figure S3 in Add- 
itional file 3). However, there was a better correlation 
between both classification methods for compounds 
with strong biomarkers of response (upper third; Spear- 
man correlation coefficient 0.84) and compounds without 
a clear signal associated with drug response (lower 
third; Spearman correlation coefficient 0.46). This sug- 
gests that for compounds with strong biomarkers, a 
signature can be identified by either approach. For 
compounds with a weaker signal of drug response 
(middle third), there was a larger discrepancy in per- 
formance between both classification methods (Spear- 
man correlation coefficient 0.16), with neither of them 
outperforming the other. 

Thirteen of the 51 compounds (25%) showed a strong 
transcriptional subtype-specific response (AUC >0.70), with 
the best omics signature not adding predictive information 
beyond a simple transcriptional subtype-based prediction 
(AUC increase below 0.1) (Figure 2; Additional file 5). This 
suggests that the use of transcriptional subtype alone could 
greatly improve prediction of response for a substantial 
fraction of agents [8], as is already done for the estro- 
gen receptor (ER), ERBB2 receptor, and selective use of 
chemotherapy in breast cancer subtypes. This is con- 
sistent with our earlier report that molecular pathway 
activity varies between transcriptional subtypes [8]. 
However, deeper molecular profiling added significant 
predictive information about probable response for the 
majority of compounds (33/51 = 65%) with an increase 
in AUC of at least 0.1 beyond subtype alone. Mutation 
status of the seven genes introduced above was in general 
not more predictive than any other dataset, with the 
exception of tamoxifen and CGC-11144. For tamoxifen 
response, prediction based on mutation status was sub- 
stantially better than subtype, driven predominantly by 
the higher mutation prevalence of PIK3CA mutations 
in luminal compared to basal breast cancer and there- 
fore an association of PIK3CA mutation with lack of 
response [4]. For CGC-11144, the mutation-based AUC 
was 0.70, primarily driven by TP53 and much higher 
than obtained with the best performing molecular data 
set (methylation, AUC 0.42). 

In vivo validation of the cell line-derived response signatures 

We validated in vitro signatures for expression profiles 
from tumor samples with response information, in addition 



Daemen et al. Genome Biology 2013, 14:R1 10 
http://genomebiology.com/201 3/1 4/1 0/R1 1 0 



Page 4 of 14 



Training 





B)GI50data: 
70 lines x 138 drugs 



A) Breast cancer 
cell panel: 84 lines 




C) Molecular profiling: 
7 data types 



Testing 

E) TCGA data: 306 tumors 



CNV ^ 

Exp 
Meth 




D) 90 drug response 
signatures per data 
type combination 



Treatment 


\[<.<l,-l 


\r< ' 


Data 


WPMV 


Probability 




MLX4924 


LSSVM 




EA-Meth 


'VI 


0.930 


• 


Oxamflatin 


LSSVM 


0.610 


U133A 


LOO 


0.930 


+ 


5-F<IUR 


LSSVM 


o.f.r,.-, 


U133A 


100 


0.920 




I'l)!lSn.-,!)(I.( ' Lahs) 


RF 


0.742 


U133A | 100 






Geldaiiaiiiycin 


LSSVM 




U133A-M.-lli 


100 


0.847 


+ 


BIBW2992 


LSSVM 


0.919 


SXP 


100 


0.726 


+ 


SB-715992 


RF 


().(i. r )i 


SNP 


100 


0.656 


+ 



F) Treatment response 
predictions for 22 drugs 



Figure 1 Cell line-based response prediction strategy. (A) We assembled a collection of 84 breast cancer cell lines composed of 35 luminal, 
27 basal, 10 claudin-low, 7 normal-like, 2 matched normal and 3 of unknown subtype. Fourteen luminal and 7 basal cell lines were also 
ERBB2-amplified. (B) Seventy lines were tested for response to 138 compounds by growth inhibition assays. Compounds with low variation in 
response in the cell line panel were eliminated, leaving a response data set of 90 compounds. Cell lines were divided into a sensitive and 
resistant group for each compound using the mean Gl 50 value for that compound. (C) Seven pretreatment molecular profiling data sets were 
analyzed to identify molecular features associated with response. Exome-seq data were available for 75 cell lines, followed by SNP6 data for 74 
cell lines, RNAseq for 56, exon array for 56, RPPA for 49, methylation for 47, and U133A expression array data for 46 cell lines. All 70 lines were 
used in development of at least some predictors depending on data type availability. (D) Classification signatures were developed using the 
molecular feature data (after filtering) and with response status as the target. Two methods, weighted least squares support vector machine 
(LS-SVM) and random forests (RF), were utilized. The best performing signature was chosen for each drug and data type combination. This allows 
prediction of response for additional cell lines or tumors with any given combination of input data types. (E) Cell line-based response predictors 
were applied to 306 TCGA breast tumors for which expression (Exp), copy number (CNV) and methylation (Meth) measurements were all 
available. (F) This identified 22 compounds with a model AUC >0.7 for which at least some patients were predicted to be responsive with a 
probability >0.65. Thresholds for considering a tumor responsive were objectively chosen for each compound from the distribution of predicted 
probabilities and each patient was assigned to a status of resistant, intermediate or sensitive. WPMV, weighted percent of model variables. 



to an assessment of cell line signal in tumor samples (Sup- 
plementary Results in Additional file 3). Such independent 
information was available for tamoxifen [17-20] and the 
histone deacetylase inhibitor valproic acid [21]. The inde- 
pendent tamoxifen data are from a meta-analysis where 
relapse-free survival status was available for 439 ER- 
positive patients [17-20]. Our in vitro 174-gene signature 
for tamoxifen, built on the complete panel of cell lines 



regardless of ER status, predicted a significantly improved 
relapse-free survival for patients predicted to be tamoxifen- 
sensitive (log- rank test; P- value 0.02; Figure 3). For valproic 
acid, therapeutic responses were tested for 13 tumor 
samples grown in three-dimensional cultures [21]. Our 
in vitro 150-gene signature for the histone deacetylase 
inhibitor vorinostat (Figure S4a in Additional file 3) distin- 
guished valproic acid responders from non-responders 
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(See figure on previous page.) 

Figure 2 Comparison of transcriptional subtype and molecular profiling for 51 (57%) of the compounds with predicted compound 
response within the cell lines with high estimated accuracy (AUC >0.70). AUC obtained with transcriptional subtype is shown in gray. 
Compounds are ordered based on increase in AUC from subtype to the best performing molecular data. The increase in AUC with respect to 
subtype obtained with the best performing molecular data is shown in cyan. For 65% of the compounds, molecular profiling performed 
substantially better than subtype, with an AUC increase of at least 0.1 (compounds above the red dashed line). Subtype was sufficient for 25% of 
the compounds with AUC >0.70 and AUC increase obtained with molecular profiling less than 0.1 (compounds below the red dashed line with 
subtype AUC above the blue solid line). 



(AUC = 0.97), with 7/8 sensitive samples (87.5%) and 4/5 
resistant samples (80%) classified correctly when using a 
probability threshold of 0.5 for response dichotomization 
(Figure S4b in Additional file 3). 

Unfortunately, omic profiles and corresponding clinical 
responses are not available for the other compounds tested 
in vitro. For these, we investigated whether the in vitro pre- 
dictive signature was present in 536 breast TCGA tumors 
and consistent with the signature observed in cell lines. 
Here, we limited our analyses to those data types that are 
available in the TCGA dataset. Specifically, we developed 
response predictors for the breast cancer cell line panel 
using profiles for expression (U133A, exon array at the 
gene level, or RNAseq at the gene level), copy number, 
and promoter methylation for 51 compounds for which 
predictive power was high (AUC >0.7; Additional file 5). 
We applied these signatures to a set of 369 luminal, 95 
basal, 8 claudin-low, and 58 ERBB2-amplified samples 
from the TCGA project. We used profiles of expression 
(n = 536), copy number (n = 306) and promoter methy- 
lation (n = 318) in our analyses. Additional file 5 shows 
that the transcriptional subtype specificities measured 
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Figure 3 Validation of the cell line signature for tamoxifen in a 
meta-set of 439 breast cancer patients treated with tamoxifen. 

Kaplan-Meier plot of relapse free survival for patients predicted to be 
sensitive versus resistant to tamoxifen according to the 174-gene 
cell line-based predictor. 



for these compounds in the cell lines were concordant 
with the subtype of TCGA samples predicted to re- 
spond. Figure S5 in Additional file 3 shows the pre- 
dicted probability of response to four compounds with 
test AUC >0.7 for TCGA tumor samples ordered ac- 
cording to increasing probability. Importantly, genes in 
these signatures that were coordinately regulated in the 
set of cell lines were also coordinately regulated in the 
tumor samples (average Jaccard coefficient = 0.68, P- 
value <0.0001; Figure S6 in Additional file 3). This panel 
of 51 compounds represented most major therapeutic 
target classes (phosphatidylinositol 3-kinase (PI3K), re- 
ceptor tyrosine kinase, anti-mitotic, DNA damage, cell 
cycle, proteasome, anti-metabolite, TP53, mitogen- 
activated protein kinase (MAPK), and estrogen antagon- 
ist). Eighteen of these compounds have been approved by 
the US Food and Drug Administration, including five for 
breast cancer. Phase I clinical trials are ongoing for seven 
compounds, phase II trials are underway for seven com- 
pounds, including six for breast cancer, and one com- 
pound is currently being tested in a phase III trial 
(Additional file 5). Thus further validation of signatures 
may be possible in the near future. 

Robust predictors of drug response are found at all levels 
of the genome 

With seven data types available on a single set of samples, 
we were well-positioned to assess whether particular tech- 
nologies or molecular data types consistently out-perform 
others in the prediction of drug sensitivity. To obtain a 
ranking of the importance of the molecular datasets, we 
compared prediction performance of classifiers built on in- 
dividual data sets and their combination for 29 common 
cell lines (with the exclusion of the normal-like cell lines). 
Importantly, no single data type performed well for all com- 
pounds, with each data type performing best for some com- 
pounds (Figure S7 in Additional file 3). Table S6a,c in 
Additional file 3 shows the ranking of the datasets accord- 
ing to the independent classifiers obtained with LS-SVM 
and RF, respectively. For the LS-SVM classifiers, RNAseq 
performed best for 22 compounds, exon array for 20 
compounds, SNP6 for 18, U133A for 17 and methylation 
data for 12 compounds (Table S6a in Additional file 3). 
Similar results were confirmed with the RF approach 
(Table S6c in Additional file 3). Even though it had varying 
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performance for individual compounds, in general, RNAseq 
significantly outperformed all other data types across 
the complete panel of 90 compounds (paired £-test with 
multiple testing correction; P-values ranging from 0.026 
to 2e-8; Figure 4). SNP6 copy number data resulted in 
significantly worse predictive power compared to all other 
data types (P-values 0.038 to 2e-8). In addition, exon array 
outperformed U133A, with a P-value of 0.0002. 

In Table S6b,d in Additional file 3, a distinction is made 
between two groups of compounds: compounds for which 
all datasets perform similarly well (for example, CGC-11047, 
GSK461364, GSK2126458, lapatinib) versus compounds 
for which results with one dataset are much better than 
obtained with any of the other datasets, defined as an 
AUC increase of at least 0.1. For example, exon array 
worked best for VX-680 (AUC 0.81), RNAseq for carbopla- 
tin (AUC 0.89), and RPPA for bortezomib (AUC 0.87). Data 
type specificity was in general not related to therapeutic 
compound class, although there were a few exceptions for 
LS-SVM with RNAseq performing well for polyamine an- 
alogs (CGC-11047, CGC-11144) and mitotic inhibitors 
(ixabepilone, paclitaxel, vinorelbine), SNP6 for ERBB2/ 
epidermal growth factor receptor (EGFR) inhibitors 
(AG 1478, BIBW2992, erlotinib, gefitinib, lapatinib), and 
methylation for CDK1 inhibitors (NU6102, purvalanol A). 



The full combination of genome-wide datasets yielded a 
higher AUC value than the best performing individual 
dataset for only a limited number of compounds (AKT1-2 
inhibitor, GSK461364 and PF-4691502). The full combin- 
ation signatures, however, generally ranked closely to 
the best signatures based on individual data types. We 
refer to the 'Robust predictors of drug response' section in 
Supplementary Results in Additional file 3 for two additional 
complementary analyses on dataset comparison. 

Splice-specific predictors provide only minimal 
information 

We compared the performance of classifiers between the 
fully featured data and gene-level data in order to inves- 
tigate the contribution of splice-specific predictors for 
RNAseq and exon array data. The fully featured data in- 
cluded transcript- and exon-level estimates for the exon 
array data and transcript-, exon-, junction, boundary-, 
and intron-level estimates for the RNAseq data. Overall, 
there was no increase in performance for classifiers built 
with 'splice- aware' data versus gene level only. The over- 
all difference in AUC from all features minus gene-level 
was 0.002 for RNAseq and -0.006 for exon array, a negli- 
gible difference in both cases. However, there were a few 
individual compounds with a modest increase in 



o 

3 



1.0 



0.9 



0.8 



0.7 



| 06 

Q. 

o 

0.5 
0.4 H 
0.3 




■ v 



if) 
< 



Q_ 



< 



■ RF 

• LSSVM 



Figure 4 Boxplot of best AUC values for all 90 compounds across 6 data types. For all data types, the highest AUC obtained with either 
approach (LS-SVM in red circles, or random forest in blue squares) is displayed. For RNAseq and exon array, the highest AUC is shown among 
models built on gene-level data only versus all features (exons, junctions, and so on). The one-way repeated measures ANOVA test revealed a 
significant difference in performance among any of the data types (P-value 2.6e-5). Post hoc pairwise comparisons with multiple testing correction 
revealed a significant outperformance of RNAseq with respect to all other data types. SNP6 copy number performed significantly worse 
compared to all other data types, and exon array additionally significantly outperformed U133A. 
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performance when considering splicing information 
(Table S8 in Additional file 3). Interestingly, both ERBB2 
targeting compounds, BIBW2992 and lapatinib, showed 
improved performance using splice-aware features in 
both RNAseq and exon array datasets. This suggests that 
splice-aware predictors may perform better for predic- 
tion of ERBB2 amplification and response to compounds 
that target it. However, the overall result suggests that 
prediction of response does not benefit greatly from spli- 
cing information over gene-level estimates of expression. 
This indicates that the high performance of RNAseq for 
discrimination may have more to do with that technol- 
ogy's improved sensitivity and dynamic range, rather 
than its ability to detect splicing patterns. 

Pathway overrepresentation analysis aids in 
interpretation of the response signatures 

We surveyed the pathways and biological processes 
represented by genes for the 49 best-performing 
therapeutic response signatures incorporating copy 
number, methylation, transcription, and/or proteomic 
features (i.e., no mutation status) with AUC >0.7 (Add- 
itional file 5). For these compounds we created func- 
tionally organized networks with the ClueGO plugin in 
Cytoscape [22] using Gene Ontology (GO) categories 
and Kyoto Encyclopedia of Genes and Genomes 
(KEGG)/BioCarta pathways (Supplementary Methods in 
Additional file 3). Our previous work identified tran- 
scriptional networks associated with response to many 
of these compounds [8]. In this study, 5 to 100% (median 
79%) of GO categories and pathways present in the pre- 
dictive signatures were found to be significantly associ- 
ated with drug response (FDR P-value <0.05). The 
majority of these significant pathways, however, were 
also associated with transcriptional subtype (17 to 
100%, median 70%). These were filtered out to capture 
subtype-independent biology underlying each compounds 
mechanism of action. The resulting non-subtype-specific 
pathways with FDR P-value <0.05 are shown in Additional 
file 6. Eighty-eight percent of the compounds for which 
we conducted pathway analysis were significantly asso- 
ciated with one or more GO category and 80% were sig- 
nificantly associated with one or more KEGG pathway. 
The most commonly identified KEGG pathways (six or 
more compounds) were hedgehog signaling, basal cell 
carcinoma, glycosphingolipid biosynthesis, ribosome, 
spliceosome and Wnt signaling. The most commonly 
identified GO processes (six or more compounds) also in- 
cluded many critical cancer pathways and processes, such 
as regulation of cell cycle, cell death, protein kinase activity, 
metabolism, TGF(3 receptor signaling, cell-cell adhesion, 
microtubule polymerization, and Wnt receptor signaling. 
Many of these processes can be linked directly to the known 
mechanisms of action of their associated compounds. For 



example, the signature for docetaxel was significantly 
enriched for microtubule polymerization genes. Docetaxel is 
known to function by microtubule disassembly inhibition. 
Similarly, signatures for the AKT1/2 kinase inhibitor, 
bosutinib SRC kinase inhibitor, TCS PIM-11 kinase in- 
hibitor and four PI3K inhibitors (GSK21 19563, 
GSK2126458, PF-4691502, TGX-221) were all enriched in 
genes involved in the negative regulation of protein kinase 
activity. These kinase regulation genes tended to be consist- 
ently up-regulated or both methylated and down-regulated, 
depending on the therapeutic response signature. Many of 
the genes in this enriched gene set have well-described roles 
in modulation of the PI3K/MAPK cascades, including 
ERRFI1 [23], DUSP6/7/8 [24] and SPRY1/2/4 [25]. In par- 
ticular, we found that high expression of GADD45A was 
associated with resistance to GSK2126458, PF-4691502 and 
the AKT1/2 inhibitor, which is consistent with the observa- 
tion that AKT inhibition modulates cell growth via activa- 
tion of GADD45A [26]. The pan-PI3K targeting agent 
GSK2126458 is reported to function as a competitive ATP 
binding inhibitor and the signature for this compound was 
over-represented in ATP metabolic processes [27]. 

Genomic aberrations and transcriptomic/proteomic 
features played prominent roles in some of the candidate 
response signatures. For copy number aberrations, ERBB2 
amplification was strongly associated with response to the 
ERBB2 targeting compounds lapatinib (two-sample £-test, 
P-value 2.1e-ll) and BIBW2992 (1.6e-5) and to EGFR in- 
hibitors AG1478 (2.5e-4) and gefitinib (9.5e-4). In addition 
to the association of overall mutation status with tamoxifen 
and CGC-11144 response discussed above, we also found 
several individual mutations to be relevant for treatment 
response. The presence of mutations in TP53 was strongly 
associated with response to the PI3K inhibitor BEZ235, with 
13/25 (52%) of the sensitive cell lines harboring TP53 muta- 
tions compared to 3/19 (16%) for the resistant cell lines 
(Fishers exact test, P-value 0.025). This may be an indica- 
tion of synthetic lethality resulting from BEZ235 inhibition 
of ATR (Ataxia telangiectasia and Rad3-related protein) 
leading to replicative stress in TP53-deficient cells [28]. 
Kim et al. [29] showed a similar trend in a study of 310 
cell lines across multiple lineages in which co-mutation of 
TP53 and PIK3CA was positively associated with response 
to BEZ235. In our study, mutation status for PIK3CA 
was associated with response to the PI3K inhibitor 
GSK1059615B, with 11/27 (41%) sensitive cell lines 
carrying PIK3CA mutations compared to 2/21 (10%) 
for resistant cell lines (P-value 0.022). These findings 
are consistent with recent clinical observations in pa- 
tients with breast and gynecologic malignancies where 
treatment with similar agents resulted in response for 
30% of patients with PIK3CA mutations compared to a 
response rate of 10% in wild-type PIK3CA patients 
[30]. 
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Response signature Toolbox to predict response in 
individual tumors 

Our long-term goal is to develop a way to select therapeutic 
compounds most likely to be effective in an individual pa- 
tient. A shorter-term goal is to test experimental com- 
pounds in patients that are most likely to be responsive. 
Both of these goals require a strategy to order compounds 
according to their predicted relative efficacy for individual 
patients. To this end, we developed software to rank order 
compounds for predicted efficacy in individual patients (see 
the 'Patient response prediction toolbox in R' section in 
Supplementary Results in Additional file 3). The software 
applies signatures of response developed in vitro to mea- 
surements of expression, copy number, and/or methylation 
for individual samples and produces a list of recommended 
treatments ranked according to predicted probability of re- 
sponse and in vitro GI 50 dynamic range. For cases where 
several compounds are predicted to be equally effective, 
highest priority is assigned to the compound with high- 
est GI 50 dynamic range in the cell line panel. 

Given the concordance of the predictive signatures for 
the 51 compounds in gene expression and subtype asso- 
ciation between the cell lines and tumor samples from 
TCGA, we applied our in vitro response predictors to the 
306 sample subset for which expression, copy number and 
methylation measurements were all available. This identi- 
fied 22 compounds with a model AUC >0.7 for which at 
least some patients were predicted to be responsive with a 
probability >0.65. In all cases, thresholds for considering a 
tumor responsive were objectively chosen for each com- 
pound from the distribution of predicted probabilities and 
each patient was assigned to a status of resistant, intermedi- 
ate or sensitive (see the 'Patient response prediction toolbox 
in R' section in Supplementary Results in Additional file 3, 
and Table S10 in Additional file 3). The resulting pattern of 
predicted sensitivity for the 22 compounds is displayed in 
Figure 5. Most of the compounds were predicted to have 
strong transcriptional subtype specificity (P-values 1.5e-70 
to 0.02) although gefitinib and NU6102 were exceptions 
(Table S9 in Additional file 3). Not surprisingly, predicted 
sensitivity to lapatinib, BIBW2992 and to a lesser extent 
EGFR inhibitors was highly specific to ERBB2+ patients. 
Similarly, ER+ (and also many ERBB2+) patients were 
more frequently predicted to be sensitive to the PI3K 
inhibitors, AKT inhibitors, tamoxifen and to a lesser extent 
fluorouracil (5-FU). Patients in the basal (ER-/ERBB2-) sub- 
type were predicted to be sensitive to cisplatin, PLK inhibi- 
tor (GSK461364A), bortezomib, gamma-secretase inhibitor 
(PF-3084014), paclitaxel and Nutlin 3A. The percentage of 
patients predicted to respond to any given compound 
ranged from 15.7% for BIBW2992 to 43.8% for the PI3K 
alpha inhibitor GSK21 19563. Nearly all patients (99.3%) 
were predicted to respond to at least one treatment and each 
patient was predicted to be sensitive to an average of 



approximately six treatments. The predicted response rate to 
5-FU was estimated at 23.9% (Figure S8 in Additional file 3), 
in agreement with the observed response rates to 5-FU as 
monotherapy in breast cancer (17% [21] to 26% [31,32]). 
The compound response signatures for the 22 compounds 
featured in Figure 5 are presented in Additional file 7. 

Conclusions 

In this study we developed strategies to identify molecu- 
lar response signatures for 90 compounds based on mea- 
sured responses in a panel of 70 breast cancer cell lines, 
and we assessed the predictive strengths of several strat- 
egies. The molecular features comprising the high quality 
signatures are candidate molecular markers of response 
that we suggest for clinical evaluation. In most cases, the 
signatures with high predictive power in the cell line panel 
show significant PAM50 subtype specificity, suggesting 
that assigning compounds in clinical trials according 
to transcriptional subtype will increase the frequency 
of responding patients. However, our findings suggest 
that treatment decisions could further be improved for 
most compounds using specifically developed response 
signatures based on profiling at multiple omic levels, 
independent of - or in addition to - the previously de- 
fined transcriptional subtypes. We make available the 
drug response data and molecular profiling data from 
seven different platforms for the entire cell line panel as a 
resource for the community to aid in improving methods 
of drug response prediction. 

We found predictive signatures of response across all 
platforms and levels of the genome. When restricting 
the analysis to just 55 well-known cancer proteins and 
phosphoprotein genes, all platforms do a reasonable job 
of measuring a signal associated with and predictive of drug 
response. This indicates that if a compound has a molecu- 
lar signature that correlates with response, it is likely that 
many of the molecular data types will be able to measure 
this signature in some way. Furthermore, there was no sub- 
stantial advantage of the combined platforms compared 
with the individual platforms. Some platforms might be 
able to measure the signature with slightly better accuracy, 
but our results indicate that many of the platforms could 
be optimized to identify a response-associated predictor. 

Conversely, in the genome-wide comparison, the more 
comprehensive platforms are the ones that overall re- 
sulted in better prediction performance. This difference 
may reflect the fact that for those platforms, we selected 
the most significant feature per gene. For example, when a 
gene measured on the Affymetrix microarray is significantly 
differentially expressed, the chance is high that a particular 
exon or transcript is even more significant. Thus, the rich- 
ness of data types like RNAseq offer the chance to identify 
both the signature and the most useful specific gene regions 
and junctions for use in a diagnostic (Figure 4). Taken 



Daemen et al. Genome Biology 2013, 14:R1 10 
http://genomebiology.com/201 3/1 4/1 0/R1 1 0 



Page 10 of 14 




TCGA Breast Tumor Drug Response Predictions 
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Figure 5 Heatmap representation of predicted sensitivity in the TCGA population. This heatmap represents the predicted pattern of 
sensitivity in 306 TCGA patients with expression, methylation and copy number data for 22 compounds with model AUC >0.7 and with at least 
some patients predicted to respond with probability >0.65. For the 306 patients, association is shown with ER, progesterone receptor (PR), ERBB2 
status, tumor size (T), lymph node involvement (N), distant metastasis (M), and subtype. Of the 22 therapeutic compounds, 5 are 
chemotherapeutic and 17 are targeted agents. Probability values shown were re-scaled to reflect custom sensitivity thresholds as described in 
Supplementary Methods of Additional file 3. Res., resistant; Int., intermediate; Sens., sensitive. 



together, these results suggest that the more comprehensive 
genome-wide platforms could be used for discovery, and 
once identified, significant features can be migrated to alter- 
native platforms for a lab diagnostic. 

Currently, treatment decisions are guided by ER and 
ERBB2 status. Using the TCGA dataset of 306 samples with 
expression, copy number and methylation measurements 
as a hypothetical example (Figure 5), a personalized 
treatment decision would be available for 81% of pa- 
tients based on ERBB2 or ER status alone (55 ERBB2+, 
193 ERBB2-/ER+). However, given reported response 
rates for trastuzumab (15 to 50%) [33] and tamoxifen 
(approximately 25%) [34] we can expect a substantial 
fraction of these will not respond. The candidate pre- 
dictors proposed here could inform such clinical deci- 
sions for nearly all patients. Therefore, by considering 
diverse molecular data, we might suggest treatment options 
for not only the approximately 20% of patients who are 



ERBB2-/ER- but also secondary treatment options for 
those who will suboptimally respond to ER or ERBB2 
directed treatments. 

While our efforts to develop predictive drug response 
signatures are quite promising, they come with several 
conceptual caveats. Although the cell line panel is a 
reasonable model system, it does not capture several 
features known to be of critical importance in primary 
tumors. In particular, we have not modeled influences 
of the microenvironment, including additional cell 
types known to contribute to tumorigenesis [35], as 
well as variation in oxygen content, which has been 
shown to influence therapeutic response [36]. Expanding 
these experiments to three-dimensional model systems or 
mouse xenografts would aid in translation to the clinic. 
Additionally, validating these predictors in independent 
data sets will be important for determining how robust 
they are (see Supplementary Results and Additional file 8). 
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Despite these limitations, our observation that we could 
find evidence of these predictive signatures in the TCGA 
data suggests that our cell line system is likely captur- 
ing many of the key elements involved in mediating 
therapeutic response. 

Of course, the cell line-derived predictive signatures 
described in this study require substantial clinical val- 
idation. One possibility is in neoadjuvant trials like the 
I-SPY 2 TRIAL [37], in which in vitro-denved signatures 
for individual compounds are tested for power in predicting 
pathologic complete response or change in tumor volume 
measured with magnetic resonance imaging. An alternative 
approach for validation of signatures for approved drugs 
is to compare outcomes in patients assigned compounds 
according to in vitro predictors with outcomes in patients 
assigned drugs according to physicians' first treatment 
choice. This study constitutes the basis for such a trial, 
with the development of a portfolio of in vitro predictors 
(for example, the 22 compounds displayed in Figure 5) and 
a computational tool that physicians might use to select 
compounds from that portfolio for individual patients. 

Regardless of the specific design of the clinical trial, 
gene expression, methylation and copy number levels 
should be collected for all patients. High throughput 
sequencing techniques can provide all three with the 
additional benefits of alternative splicing information. 
As outlined in Figure 1, measurements of expression, 
methylation and copy number would serve as input to the 
predictor toolbox. The output of the toolbox consists of a 
report for each individualized patient, with the 22 thera- 
peutic compounds ranked according to a patients likeli- 
hood of response and in vitro GI 50 dynamic range. The 
full panel of 22 drug compounds could be tested simultan- 
eously in a multi-arm trial to speed up the validation of 
the in vitro approach. The proposed clinical trial may also 
involve further optimizing of the number of markers in 
the signatures and choosing clinically relevant thresholds 
for tumor classification. 

Materials and methods 

We refer to Supplementary Methods in Additional file 3 
for a detailed description of the therapeutic compound 
response data, molecular data for the breast cancer cell 
lines, molecular data for the external breast cancer tumor 
samples used for validation, classification methods, data 
integration approach, statistical methods, pathway overrep- 
resentation analysis, and the patient response prediction 
toolbox for the R project for statistical computing. 

Data and code deposition 

Genome copy number data have been deposited at the 
European Genome-phenome Archive (EGA) [38], hosted 
at the EBI (accession numbers EGAS00000000059 and 
EGAS00001000585). Gene expression data for the cell 



lines were derived from Affymetrix GeneChip Human 
Genome U133A and Affymetrix GeneChip Human Exon 
1.0 ST arrays. Raw data are available in ArrayExpress 
[39], hosted at the EBI (accession numbers E-TABM-157 
and E-MTAB-181). RNAseq and exome-seq data can be 
accessed at the GEO, [40], accession number GSE48216. 
Genome-wide methylation data for the cell lines are also 
available through GEO, accession number GSE42944. 
Software and data for treatment response prediction are 
available on Synapse [41]. The software has also been 
deposited at GitHub [42]. The raw drug response data 
are available as Additional file 9. 

Additional files 



Additional file 1: Table SI. Overview of 84 cell lines with subtype 
information and available data. Gl 50 values for 90 therapeutic compounds 
are provided for 70/84 cell lines included in all analyses. 

Additional file 2: Table S2. Processed Reverse Protein Lysate Array 
(RPPA) intensity data for 70 (phospho)proteins with fully validated 
antibodies in 49 cell lines. See Supplementary Methods in Additional file 
3 for data processing details. 

Additional file 3: Supplementary Methods, Supplementary Results, 
Figures S1 to S10, and Tables S4, S6, S8, S9, S10, SI 2, and SI 3. 
Supplementary Methods: detailed description of the therapeutic 
compound response data, molecular data for the breast cancer cell lines, 
molecular data for the external breast cancer tumor samples used for 
validation, classification methods, data integration approach, statistical 
methods, and pathway overrepresentation analysis. Supplementary 
Results: assessment of cell line signal in tumor samples, inter-data 
relationships, prediction comparison of datasets, validation against other 
cell line datasets, and the patient response prediction toolbox for the R 
project for statistical computing. Table S4: overview of genes with good 
correlation (FDR P-value <0.05) between SNP6 and gene expression; 22 
to 39% of genes in copy number aberration regions show a significant 
concordance between their genomic and transcriptomic profile after 
multiple testing correction. Table S6: data type ranking of the 
importance of the molecular datasets by comparison of prediction 
performance of LS-SVM and RF classifiers built on individual data sets and 
their combination, and by comparison of the average appearance of data 
types in the top 100 of ranked features, with and without inclusion of 
RPPA data. Examples are also provided of compounds for which (most) 
datasets give similar results or for which one dataset performs better 
(shown in bold). Table S8: performance for 'splice-specific' response 
predictors (RF) with an AUC increase >0.05 when comparing all transcript 
features to gene-level values alone. Table S9: statistical association 
between clinical variables and predicted response for 306 TCGA patients 
with expression, methylation and copy number data available. For each 
compound, the best performing model was utilized (LS-SVM or RF with 
any combination of expression, copy number and methylation data). 
Table S10: resistant/intermediate/sensitive cutoffs for 22 compounds 
with model AUC >0.7 and at least one patient with probability of 
response >0.65. Cutoff value 1 separates patients considered resistant 
from intermediate. Cutoff value 2 separates patients considered 
intermediate from sensitive. The percentage value for each group 
indicates the percentage of total patients (n = 306) in each group. 
Table S12: presence and variance of filtered features from U133A and 
exon array cell line data in tumor samples. Features from U133A and the 
exon array that passed the variance and presence filter in the cell lines 
were present in the majority of breast cancer tumor samples. Table S13: 
summary of 167 predictors in random forests classifier for lapatinib 
(all data types, optimal predictor number). Figure SI: data summary in 
terms of number of features before and after data-type-specific reduction 
and unsupervised filtering based on variance and signal detection above 
background. Figure S2: overview of the mutation prevalence in the cell 
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line panel and TCGA data set for the list of seven common coding 
variants detected by TCGA, with a distinction between luminal, basal and 
ERBB2-enriched. Cell lines with unknown subtype are displayed in 
orange. To make the subtypes comparable, luminal A and B were 
grouped into luminal for the TCGA data set, whilst basal and claudin-low 
cell lines were grouped into basal. The mutation rate in TCGA and the 
cell line panel shows a similar distribution across the subtypes. 
Figure S3: comparison of the best LS-SVM and RF models for the 90 
compounds, sorted according to highest AUC obtained with either 
model. Figure S4: validation of the cell line signature for vorinostat in 
tumor samples grown in three dimensions: heatmap of the 150-gene 
signature for vorinostat in the cell line panel and 13 tumor samples 
treated with valproic acid. Seven out of eight sensitive samples (87.5%) 
and four out of five resistant samples (80%) are classified correctly with a 
probability threshold of 0.5 for response dichotomization. Figure S5: 
predicted probability of response of TCGA tumor samples to compounds 
lapatinib, sigma AKT1-2 inhibitor, GSK21 26458 and docetaxel. The TCGA 
tumor samples are ordered according to increasing probability of response. 
Figure S6: correlation-based coherence heatmap for two cell line-derived 
gene signatures: coherence among 67 genes of the U133A signature for the 
sigma AKT1-2 inhibitor in the cell lines (left) and TCGA tumor samples (right) 
(Jaccard coefficient = 0.85; P-value <0.0001); coherence among 109 genes of 
the RNAseq signature for everolimus in the cell lines (left) and TCGA tumor 
samples (right) (Jaccard coefficient = 0.79; P-value <0.0001). Figure S7: 
comparison of the best model per dataset for the 90 compounds, sorted 
according to highest AUC obtained with either model (LS-SVM or RF). For 
RNAseq and exon array, the highest AUC is shown among models built on 
gene-level data only or all features (exons, junctions, and so on). Figure S8: 
distributions of response probabilities for 5-FU determined by mixed 
model clustering and used for cutoff selection. With a cutoff of 0.74, 
23.9% of TCGA tumor samples were predicted to respond to 5-FU 
(Table S10 in Additional file 3). Figure S9: association between 
response to lapatinib and ERBB2 status, response to BIBW2992 and 
ERBB2 status, and response to tamoxifen and ER status for 306 TCGA 
patients with expression, methylation and copy number data 
available. Figure S10: heatmap of the 167 highest ranked features for 
lapatinib, obtained with RF applied to the full set of molecular data. 

Additional file 4: Table S3. Gl 50 dichotomization threshold for each 
compound, defined as the mean Gl 50 for the 48 core cell lines. 

Additional file 5: Table S5. Overview of the best LS-SVM/RF model for 
all 90 therapeutic compounds with comparison to the LS-SVM AUC based 
on subtype and ERBB2 status. For the subset of 51 therapeutic compounds 
with test AUC exceeding 0.7, additional information is provided on clinical 
trial status, comparison of Gl 50 with TGI, validation results of the cell line 
signal in the TCGA tumor samples, and most significant non-subtype related 
KEGG/BioCarta pathways from Additional file 6. 

Additional file 6: Table S7. List of significant non-subtype specific GO 
categories and KEGG/BioCarta pathways with FDR P-value <0.05. Per 
category/pathway information includes FDR P-value and the number of 
signature genes, percentage of signature genes and list of signature genes 
that are part of this category/pathway. Significant pathways associated with 
both drug response and transcriptional subtype were excluded, to capture 
biology underlying each compound's mechanism of action. 

Additional file 7: Table S1 1. Compound response signatures for the 
22 compounds featured in Figure 5 with model AUC >0.7 and at least 
one patient from the TCGA set of 306 tumor samples with expression, 
copy number and methylation data available with probability of 
response >0.65. 

Additional file 8: Table S14. Validation results for six drugs 
(BIBW2992, lapatinib, rapamycin, GSK21 26458, gefitinib and 
GSK2141795) in 11 HER2+ lines. 

Additional file 9: Raw drug response data. Raw drug response data 
used to compute GI50 values used in this study. The columns represent 
the following: cellline = cell line lineage; compound = compound 
tested; drug_plate_id = unique identifier for the plate of 3 compounds; 
T0_plate_id = unique identifier for the time 0 h control plate associated 
with the drug plate; background_od1, background_od2 = background 
od values (for correction of background luminesence); od0.1, od0.2, 



od0.3 = triplicate measures for untreated cells; odl.1, odl.2, odl.3... 
od9.1, od9.2, od9.3 = triplicate measures of number of cells alive 
after treatment with lowest to highest drug; T0_background_od1, 
T0_background_od2 = background od values (for correction of 
background luminesence); T0_median_od = median od at TO; 
c1 to c9 = drug concentrations tested; units = units of drug 
concentration tested. 
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